Non-uniform Sampling in Clustering and Streaming

نویسنده

  • Morteza Monemizadeh
چکیده

Aproximating a sum without computing the summands is a classic problem in statistics and machine learning. The problem is defined as follows: Assume Z is the sum of n numbers, Z1, · · · , Zn i.e., Z = Z1 + · · ·+ Zn. The goal is to estimate Z without computing all the n summands but few. According to the uniform sampling we choose a number Zi with probability 1 n and assign the weight n to Zi. The number nZi will be our estimation for Z. We see that the expectation of the random variable nZi is Z but the variance of nZi can be large. The reason is if the number of large numbers is few, then the probability that the random sample does not take one of them will be high and if this happens the variance of nZi will be large. Using non-uniform sampling we can bound the variance in terms of the expectation and therefore estimate Z within factor (1± ) as follows: Having n probabilities ri ≥ 1 γ Zi Z for 1 ≤ i ≤ n corresponding to the numbers Z1, · · · , Zn we take a sample set A = {a1, · · · , aj, · · · , as} ⊆ [n] of indices according to the probabilities ri and assign a weight of w(Zaj) = 1 s·raj to a sampled number Zaj for 1 ≤ j ≤ s. We then use the concentration bounds to show that for s = O(γ −2 log(1/δ)) the probability that the estimator X = ∑ aj∈Aw(Zaj) · Zaj deviates from Z by more than Z is at most δ. In this thesis we study applications of this estimator in high dimensional clustering and streaming. In particular, for the k-means and the j-subspace problems we get unbiased estimators that can (1± )approximate the cost of the point set to an arbitrary center set. We then use these estimators to get coresets, linear time (1+ )-approximation and insertion only streaming algorithms. In the turnstile streaming model we are given a vector a of length n where the i-th coordinate is represented by ai and a stream S as m = poly(n,M) updates of the form (i, x), where i ∈ [n] and x ∈ {−M,−M + 1, . . . ,M − 1,M}, indicating that the i-th coordinate ai of a should be incremented by x. Let Zi = |ai| for p ∈ [0, 2], 1 ≤ i ≤ n and Z = Fp(a) = ∑n i=1 |ai| . In this model finding n probabilities ri ≥ 1 γ Zi Z using one pass and polylog space was known to be an open problem in the streaming community [CMI05]. We give a 1-pass poly( −1 logn)-space algorithm called Lp-sampler that samples according to probabilities ri for γ = (1± ), p ∈ [0, 2]. We show that the Lp-sampler leads to many improvements and a unification of well-studied streaming problems, including cascaded norms, heavy hitters, and moment estimation. In particular, as for the moment estimation using O(n1−2/k −2) L2-samplers in parallel for k > 2 we can (1 ± )-estimate Fk(a) = ∑n i=1 |ai| k using optimal space n1−2/k · poly( −1 logn). This algorithm is the first that does not use Nisan’s pseudorandom generator as a subroutine, potentially making it more practical.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

StreamKM++: A Clustering Algorithm for Data Streams∗

We develop a new k-means clustering algorithm for data streams, which we call StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm [1]. To compute the small sample, we propose two new techniques. First, we use a non-uniform sampling approach similar to the k-means++ seeding procedure to obtain small core...

متن کامل

On the Impact of Class Imbalance in GP Streaming Classification with Label Budgets

Streaming data scenarios introduce a set of requirements that do not exist under supervised learning paradigms typically employed for classification. Specific examples include, anytime operation, non-stationary processes, and limited label budgets. From the perspective of class imbalance, this implies that it is not even possible to guarantee that all classes are present in the samples of data ...

متن کامل

A StreamKM++: A Clustering Algorithm for Data Streams

We develop a new k-means clustering algorithm for data streams of points from a Euclidean space. We call this algorithm StreamKM++. Our algorithm computes a small weighted sample of the data stream and solves the problem on the sample using the k-means++ algorithm of Arthur and Vassilvitskii (SODA '07). To compute the small sample, we propose two new techniques. First, we use an adaptive, non-u...

متن کامل

Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity

Clustering is a common problem in the analysis of large data sets. Streaming algorithms, which make a single pass over the data set using small working memory and produce a clustering comparable in cost to the optimal offline solution, are especially useful. We develop the first streaming algorithms achieving a constant-factor approximation to the cluster radius for two variations of the k-cent...

متن کامل

Sublinear Algorithms for MAXCUT and Correlation Clustering

We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while Ω(n) lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011